A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program
نویسندگان
چکیده
In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.
منابع مشابه
Speech retrieval of Mandarin broadcast news via mobile devices
This paper presents a system for speech retrieval of Mandarin broadcast news. First, several data-driven and unsupervised approaches are integrated into the broadcast news transcription system to improve the speech recognition accuracy and efficiency. Then, a multi-scale indexing paradigm for broadcast news retrieval is proposed to make use of the special structural properties of the Chinese la...
متن کاملSpeech Retrieval of Mandarin Broadcas
This paper presents a system for speech retrieval of Mandarin broadcast news. First, several data-driven and unsupervised approaches are integrated into the broadcast news transcription system to improve the speech recognition accuracy and efficiency. Then, a multi-scale indexing paradigm for broadcast news retrieval is proposed to make use of the special structural properties of the Chinese la...
متن کاملMatbn 2002: a Mandarin Chinese Broadcast News Corpus
The MATBN 2002 Mandarin Chinese broadcast news corpus contains a total of 40 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast domain. We expect to collect and process 220 hours of Mandarin Chine...
متن کاملMATBN: A Mandarin Chinese Broadcast News Corpus
The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce the speech corpus and r...
متن کاملVoice retrieval of Mandarin broadcast news speech
This paper presents an improved framework for voice retrieval of Mandarin broadcast news speech. First, several unsupervised and data-driven approaches for broadcast news transcription were proposed to improve the speech recognition accuracy and efficiency. Then, a multiscale indexing paradigm for broadcast news retrieval was exploited to alleviate the problems caused by the speech recognition ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010